Introduction and Motivation

The five leading causes of death in the United States from 1999 to 2014 are cancer, heart disease, unintentional injury, chronic lower respiratory disease, and stroke. The dataset includes the U.S. Department of Health and Human Services public health regions. Therefore, we can investigate the leading causes of death of each region, and develop accordingly public health policy and remedies.

Data Descriptions

Approaches

Visualizations

Conclusion

Notes

cod_data = read_csv("./data/NCHS_-_Potentially_Excess_Deaths_from_the_Five_Leading_Causes_of_Death.csv") %>%
  clean_names() %>%
  na.omit() %>%
  filter(!(state == "United States")) %>%
  separate(., percent_potentially_excess_deaths, into = c("percent_excess_death"), sep = "%") %>% 
  mutate(percent_excess_death = as.numeric(percent_excess_death), mortality = observed_deaths/population * 10000, mortality = as.numeric(mortality)) %>% 
  select(year, age_range, cause_of_death, state, locality, observed_deaths, population, expected_deaths, potentially_excess_deaths, percent_excess_death, mortality, hhs_region)
## Parsed with column specification:
## cols(
##   Year = col_integer(),
##   `Cause of Death` = col_character(),
##   State = col_character(),
##   `State FIPS Code` = col_character(),
##   `HHS Region` = col_integer(),
##   `Age Range` = col_character(),
##   Benchmark = col_character(),
##   Locality = col_character(),
##   `Observed Deaths` = col_integer(),
##   Population = col_integer(),
##   `Expected Deaths` = col_integer(),
##   `Potentially Excess Deaths` = col_integer(),
##   `Percent Potentially Excess Deaths` = col_character()
## )
## Warning: Too many values at 191748 locations: 1, 2, 3, 4, 5, 6, 7, 8, 9,
## 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...
##columns removed

 #"state_fips_code"                      "benchmark"   "potentially_excess_deaths" "percent_excess_death"      "mortality"   

xs2329:

cod_data %>%
  group_by(cause_of_death) %>% 
  ggplot(aes(x = locality, y = mortality, fill = cause_of_death)) +
  geom_col() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title="Locality vs. Mortality") +
  labs(x="locality", y="mortality") 

This bar graph shows the distribution of cause of death within mortality in the three geographic regions: Metropolitan, Nonmetropolitan and All. We observe that the weight of each composition of cuases of death is the same regardless of the locality. Nonmetropolitan area seems to have the highest mortaility, the number of deathes observed in every 10000 people.

cod_data %>%
  group_by(cause_of_death) %>% 
  ggplot(aes(x = locality, y = mortality)) + geom_boxplot(aes(color = cause_of_death), na.rm = T) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title="Locality vs. Mortality") +
  labs(x="locality", y="mortality") 

We observe that the median of mortaility is the highest among all 5 causes of death, followed by heart disease, unintentional injury, heart disease, chronic lowee respiratory disease and stroke. This pattern remains consistant when we subdivide locaility into metropolitan region and nonmetrpolitan region, which is identical to our previous finding. All 5 cuases of death are right skewed regardless of the locality.

mortality_lm = lm(mortality ~locality, data = cod_data)
summary(mortality_lm)
## 
## Call:
## lm(formula = mortality ~ locality, data = cod_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1454 -2.9547 -1.2983  0.9988 19.8959 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              4.10331    0.01672 245.474   <2e-16 ***
## localityMetropolitan    -0.22576    0.02372  -9.519   <2e-16 ***
## localityNonmetropolitan  1.14451    0.02438  46.942   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.317 on 191745 degrees of freedom
## Multiple R-squared:  0.01821,    Adjusted R-squared:  0.0182 
## F-statistic:  1778 on 2 and 191745 DF,  p-value: < 2.2e-16

interpretation on linear regression:

Here we obtain the linear regression model of mortality = 4.103 - 0.226localityMetropolitan -1.145localityNonmetropolitan. We expect to see 0.22576 less deaths for every 10000 people in the metropolitan region as compared to all regions. We expect to see 1.14451 more death for every 10000 people in the nonmetropolitan region as compared to all regions.

cod_data %>%
  mutate(year = as.factor(year))
## # A tibble: 191,748 x 12
##      year age_range cause_of_death   state        locality observed_deaths
##    <fctr>     <chr>          <chr>   <chr>           <chr>           <int>
##  1   2005      0-49         Cancer Alabama             All             756
##  2   2005      0-49         Cancer Alabama    Metropolitan             556
##  3   2005      0-49         Cancer Alabama Nonmetropolitan             200
##  4   2005      0-49         Cancer Alabama             All             756
##  5   2005      0-49         Cancer Alabama    Metropolitan             556
##  6   2005      0-49         Cancer Alabama Nonmetropolitan             200
##  7   2005      0-49         Cancer Alabama             All             756
##  8   2005      0-49         Cancer Alabama    Metropolitan             556
##  9   2005      0-49         Cancer Alabama Nonmetropolitan             200
## 10   2005      0-54         Cancer Alabama             All            1346
## # ... with 191,738 more rows, and 6 more variables: population <int>,
## #   expected_deaths <int>, potentially_excess_deaths <int>,
## #   percent_excess_death <dbl>, mortality <dbl>, hhs_region <int>
p <- cod_data %>%
  plot_ly(
    x = ~expected_deaths, 
    y = ~observed_deaths, 
    size = ~population, 
    color = ~cause_of_death, 
    frame = ~hhs_region, 
    text = ~state, 
    hoverinfo = "text",
    type = 'scatter',
    mode = 'markers'
  ) 
p
### Question of concern: If percent excess death has significant difference between metro and non-metro groups.
gp_cod_data = cod_data %>%
  group_by(year, locality) %>% 
  summarise(mean_ped = mean(percent_excess_death))

ggplot(gp_cod_data, aes(x = year, y = mean_ped, fill = locality)) + geom_point(stat = "identity") + geom_smooth() +
facet_grid(. ~ locality) 
## `geom_smooth()` using method = 'loess'

gp_locality_lm = lm(mean_ped ~locality , data = gp_cod_data)
summary(gp_locality_lm)
## 
## Call:
## lm(formula = mean_ped ~ locality, data = gp_cod_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5667 -1.0325 -0.3470  0.7922  3.2937 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              33.8866     0.4071  83.238  < 2e-16 ***
## localityMetropolitan     -2.0599     0.5757  -3.578   0.0012 ** 
## localityNonmetropolitan   8.0460     0.5757  13.975 1.13e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.35 on 30 degrees of freedom
## Multiple R-squared:  0.9198, Adjusted R-squared:  0.9145 
## F-statistic: 172.1 on 2 and 30 DF,  p-value: < 2.2e-16
broom::tidy(gp_locality_lm)
##                      term  estimate std.error statistic      p.value
## 1             (Intercept) 33.886633 0.4071070 83.237648 4.783207e-37
## 2    localityMetropolitan -2.059931 0.5757363 -3.577906 1.200041e-03
## 3 localityNonmetropolitan  8.045979 0.5757363 13.975111 1.132457e-14

The regression model is mean percent_excess_death = 33.89 - 2.05*Metropolitan + 8.046Nonmetropolitan. For people in metropolitan, the expected percent excess death has a decrease of 2.06. For people in non-metropolitan aream the expected percent excess death has an increase of 8.046. The p-values show these predictors are both significant. The adjusted R squared is 0.91, which indicates the regression model explains 91% of variation in mean percent excess death due to the variation in locality.

# How should we help nonmetropolitan areas improve public health? 
cod_datayr2 = read_csv("./data/NCHS_-_Potentially_Excess_Deaths_from_the_Five_Leading_Causes_of_Death.csv") %>%
  clean_names() %>%
  na.omit() %>%
  filter(!(state == "United States")) %>%
  separate(., percent_potentially_excess_deaths, into = c("percent_excess_death"), sep = "%") %>%
  mutate(percent_excess_death = as.numeric(percent_excess_death), mortality = observed_deaths/population * 10000, mortality = as.numeric(mortality)) %>%
  select(year, cause_of_death, state, benchmark, locality, observed_deaths, population, expected_deaths, potentially_excess_deaths, percent_excess_death, mortality)
## Parsed with column specification:
## cols(
##   Year = col_integer(),
##   `Cause of Death` = col_character(),
##   State = col_character(),
##   `State FIPS Code` = col_character(),
##   `HHS Region` = col_integer(),
##   `Age Range` = col_character(),
##   Benchmark = col_character(),
##   Locality = col_character(),
##   `Observed Deaths` = col_integer(),
##   Population = col_integer(),
##   `Expected Deaths` = col_integer(),
##   `Potentially Excess Deaths` = col_integer(),
##   `Percent Potentially Excess Deaths` = col_character()
## )
## Warning: Too many values at 191748 locations: 1, 2, 3, 4, 5, 6, 7, 8, 9,
## 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...

average percent access death for each cause in different locality 2005-2015

cod_datayr2 %>%
  filter(benchmark == "2010 Fixed")%>%
  group_by(cause_of_death,year, locality)%>%
  mutate(year_percent_death=mean(percent_excess_death))%>%
  distinct(year, cause_of_death, locality, .keep_all = TRUE)%>%
  ungroup(year, locality)%>%
ggplot(aes(x = year, y = year_percent_death, color = cause_of_death))+
  geom_point(size = 1)+
  geom_line(size = 1) +
  facet_grid(.~locality)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  scale_x_continuous( breaks=seq(2005,2015,1))+
  labs(title = "Average yearly excess death percentage for different locality")

Interpretation: As can be seen in the plot, if we stratify by cause, we will find some increasing pattern in subgroups(unintentional injury) in certain period of time(2010-2015). In nonmetropolitan regions, unintentional injury and chronic lower respiratory disease have on average the highest excess death percentage in nonmetropolitan areas and needs more attention from public health experts.

cod_datayrlm = cod_datayr2 %>%
  filter(locality == "Nonmetropolitan",
         benchmark == "2010 Fixed")
lmpop = lm(data=cod_datayrlm, percent_excess_death~year+cause_of_death)
summary(lmpop)
## 
## Call:
## lm(formula = percent_excess_death ~ year + cause_of_death, data = cod_datayrlm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.456  -9.828   0.576  10.404  39.489 
## 
## Coefficients:
##                                                  Estimate Std. Error
## (Intercept)                                     769.06335   67.48938
## year                                             -0.36941    0.03358
## cause_of_deathChronic Lower Respiratory Disease  27.06490    0.33901
## cause_of_deathHeart Disease                      13.90257    0.32844
## cause_of_deathStroke                             12.91441    0.33784
## cause_of_deathUnintentional Injury               31.18470    0.32812
##                                                 t value Pr(>|t|)    
## (Intercept)                                       11.39   <2e-16 ***
## year                                             -11.00   <2e-16 ***
## cause_of_deathChronic Lower Respiratory Disease   79.83   <2e-16 ***
## cause_of_deathHeart Disease                       42.33   <2e-16 ***
## cause_of_deathStroke                              38.23   <2e-16 ***
## cause_of_deathUnintentional Injury                95.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.92 on 19718 degrees of freedom
## Multiple R-squared:  0.365,  Adjusted R-squared:  0.3649 
## F-statistic:  2267 on 5 and 19718 DF,  p-value: < 2.2e-16

From the linear model based on 2010 fixed benchmark, we can tell that controlling for cause of death, the excess death percentage is expected to decrease by 0.37 each year. Overall, the excess death percentage shows a decreasing pattern with time. Besides, in the same year, compared with cancer, other 4 causes have significantly higher excess death percentage. Herein unintentional injury and chronic lower respiratory disease have on average the highest excess death percentage.

Relationship between Mean Percentage of Excess Death and Five Leading Cause of Death

For each of the five cause of death, what is the mean percent of excess death? It is different across locality. yz3306: As shown in the boxplots for all three regions, the rank of the five cause of death remain the same for different localities. For all five causes, the mean percent of death is lower for Metropolitan and higher for Nonmetropolotan, with respect to All Locality.

cod_data %>%
  na.omit %>%
  group_by(cause_of_death) %>%
  mutate(mean_percent_excess_death = mean(percent_excess_death)) %>%
  ungroup(cause_of_death) %>%
  mutate(cause_of_death = fct_reorder(cause_of_death, mean_percent_excess_death)) %>%
  ggplot(aes(x = cause_of_death, y = percent_excess_death, fill = cause_of_death)) +
  geom_boxplot() +
  facet_grid(~locality) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))  +
  theme(legend.position = "bottom")

yz3306:Regress percent of excess death on cause of death, which is a categorical variable. with respect to cancer, the least of the five causes, all 4 have additional percent of death as evaluated by the estimates. The results give statisically significant estimates .

cod_lm_yz1 = lm(percent_excess_death ~cause_of_death, data = cod_data)
broom::tidy(cod_lm_yz1)
##                                              term estimate  std.error
## 1                                     (Intercept) 21.60686 0.08399078
## 2 cause_of_deathChronic Lower Respiratory Disease 21.08880 0.12091802
## 3                     cause_of_deathHeart Disease 10.89773 0.11879905
## 4                            cause_of_deathStroke 12.92196 0.12046778
## 5              cause_of_deathUnintentional Injury 25.77100 0.11875825
##   statistic p.value
## 1 257.25277       0
## 2 174.40579       0
## 3  91.73246       0
## 4 107.26483       0
## 5 217.00389       0
#PLOTLY
map_cod_data = cod_data %>%
  filter(locality == "Metropolitan") %>%
  select(state, locality, percent_excess_death) %>% 
  group_by(state) %>% 
  summarise(mean_ped = mean(percent_excess_death)) %>% 
  dplyr::filter(!(state == "District of\nColumbia"))
  

map = as.tibble(fifty_states) %>%
  group_by(id) %>% 
  summarize(clong = mean(long), clat = mean(lat)) %>% 
  filter(!(id == "district of columbia"))

df <- cbind(map, state.abb, state.center, rate = unique(map_cod_data$mean_ped))

ggplot(df, aes(map_id = id)) + 
    geom_map(aes(fill = rate), map = fifty_states) + 
    expand_limits(x = fifty_states$long, y = fifty_states$lat) + 
    labs(x = "", y = "") +
    theme(panel.background = element_blank(), 
          axis.text.x = element_blank(), 
          axis.text.y = element_blank(), 
          axis.ticks = element_blank()) + 
    geom_text(aes(x = clong, y = clat, label = state.abb)) +
  scale_fill_gradient(low="gold", high="red")

region_cod_data = cod_data %>%
  select(state, locality, hhs_region, percent_excess_death) %>% 
  group_by(state,locality, hhs_region) %>% 
  summarise(mean_ped = mean(percent_excess_death)) %>% 
  dplyr::filter(!(state == "District of\nColumbia")) %>% 
  mutate(hhs_region = as.character(hhs_region))


region_lm = lm(region_cod_data$mean_ped~region_cod_data$hhs_region)
rl1 = broom::tidy(region_lm)
kable(rl1)
term estimate std.error statistic p.value
(Intercept) 22.897478 2.124541 10.777612 0.0000000
region_cod_data$hhs_region10 6.919893 3.302733 2.095202 0.0379936
region_cod_data$hhs_region2 -1.678888 4.456475 -0.376730 0.7069571
region_cod_data$hhs_region3 15.028044 3.161418 4.753577 0.0000050
region_cod_data$hhs_region4 27.459220 2.776844 9.888645 0.0000000
region_cod_data$hhs_region5 11.065534 2.962531 3.735162 0.0002743
region_cod_data$hhs_region6 23.938689 3.103091 7.714466 0.0000000
region_cod_data$hhs_region7 12.500192 3.302733 3.784802 0.0002292
region_cod_data$hhs_region8 5.138616 2.962531 1.734536 0.0850722
region_cod_data$hhs_region9 8.502314 3.302733 2.574327 0.0111054
region_locality_lm =lm(region_cod_data$mean_ped ~ region_cod_data$hhs_region+region_cod_data$locality)
rl2 = broom::tidy(region_locality_lm)
kable(rl2)
term estimate std.error statistic p.value
(Intercept) 21.518530 2.080719 10.3418699 0.0000000
region_cod_data$hhs_region10 6.592074 2.933587 2.2471039 0.0262563
region_cod_data$hhs_region2 -0.892122 3.959828 -0.2252931 0.8220919
region_cod_data$hhs_region3 15.098291 2.807614 5.3776243 0.0000003
region_cod_data$hhs_region4 27.131401 2.466649 10.9992936 0.0000000
region_cod_data$hhs_region5 10.737714 2.631517 4.0804272 0.0000765
region_cod_data$hhs_region6 23.610869 2.756320 8.5660831 0.0000000
region_cod_data$hhs_region7 12.172373 2.933587 4.1493141 0.0000587
region_cod_data$hhs_region8 4.810797 2.631517 1.8281458 0.0697354
region_cod_data$hhs_region9 8.174495 2.933587 2.7865190 0.0060956
region_cod_data$localityMetropolitan -2.159388 1.555863 -1.3879042 0.1674526
region_cod_data$localityNonmetropolitan 7.279690 1.582741 4.5994187 0.0000096

Interpretation: U.S. Department of Health and Human Services public health regions (1 through 10) are used as a categorical variable in the above regressions. Specific region classification is shown in the figure below.

[attach image here].

As shown in the above summary tables, with respect to region 1, only region 2 have negative estimated coefficient, indicating less mean percentage of excess death . Specifically, comparing with region 1, region 2 have 1.67% less mean percentage of excess death on average. From region 3 to region 10, the mean percentage of excess death is higher comparing with region 1. Similiar results yield from regression adjusted for Locality. Adjusting locality, region 4 and 6 have top two highest increase in mean percetage of excess death with respect to region 1. Adjusting for different regions, similar result yield that, on average, metropolitan have 2% less than overall mean percetage of excess deaths and nonmetropolitan have 7% more than overall mean percetage of excess death.